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We investigate the adaptation and performance of modularity-based algorithms, designed in the scope of 
complex networks, to analyze the mesoscopic structure of correlation matrices. Using a multi-resolution 
analysis we are able to describe the structure of the data in terms of clusters at different topological levels. 
We demonstrate the applicability of our findings in two different scenarios: to analyze the neural connectivity 
of the nematode Caenorhabditis elegans, and to automatically classify a typical benchmark of unsupervised 
clustering, the Iris data set, with considerable success. 

PACS numbers: 89.75.Hc,89.75.Fb 

Keywords: Clustering, networks, community structure, multiple resolution, modularity. 



a 
i 

& 

ctf ■ 

: 

O 1 

C/3 ■ 
>Y 

^ : 



> 
00 

oo 



Facing the famous Salvador Dali's painting "Gala 
contemplating the Mediterranean sea which at 
twenty meters becomes a portrait of Abraham 
Lincoln", we have the best proof of how a com- 
plex system reveals different information when 
observed at different (in this case length) scales. 
We proposed a method 1 to unveil the equivalent 
phenomena in the description of complex net- 
works from a topological perspective. By defining 
a parameter that controls the resistance of each 
node to belong to a group, we are able to analyze 
the community structure of the network at dif- 
ferent topological scales. We apply the method 
to the exploratory analysis of the structural con- 
nectivity of the neuronal system of C. elegans and 
find a tentative classification of functional activity 
of groups of neurons at certain topological scales. 
We also have tested the method to automatically 
classify a typical benchmark of unsupervised data 
clustering, the Iris dataset. These results pave 
the way to the applicability of community detec- 
tion algorithms in complex networks to the ex- 
ploration and classification of real data sets. 



I. INTRODUCTION 

Complex networks arc graphs representative of the in- 
tricate connections between elements in many natural 
and artificial systems^—, whose description in terms of 
statistical properties has been largely developed in the 
curse for a universal classification of them. However, 
when the networks are locally analyzed some character- 
istics that become partially hidden in the statistical de- 
scription emerge. The most relevant perhaps is the dis- 
covery in many of them of community structure, meaning 



the existence of densely (or strongly) connected groups of 
nodes, with sparse (or weak) connections between thcm£. 

The study of the community structure helps to elu- 
cidate the organization of the networks and, eventually, 
could be related to the functionality of groups of nodes^. 
The most successful solutions to the community detection 
problem, in terms of accuracy, are those based in the opti- 
mization of a quality function called modularity proposed 
by Newman and Girvani that allows the comparison of 
different partitioning of the network. Given a network 
partitioned into communities, being Cj the community 
to which node i is assigned, the mathematical definition 
of modularity is expressed in terms of the weighted adja- 
cency matrix i«y , that represents the value of the weight 
in the link between nodes i and j, this weight would be 
if no link existed, and the strengths Wi = Y]^ Wij as^ 
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WiWj 

2w 



S{Ci,Cj 



(1) 



a ' Electronic mail: alcx.arcnas@urv.cat 



where the Kronccker delta function 5(Ci,Cj) takes the 
values, 1 if node i and j are into the same commu- 
nity, otherwise, and the total strength 2w = J2i w i- 
The modularity of a given partition is then, the prob- 
ability of having edges falling within groups in the 
network minus the expected probability in an equiv- 
alent (null case) network with the same number of 
nodes, and edges placed at random preserving the nodes' 
strength. The larger the modularity the best the parti- 
tioning is, cause more deviates from the null case. Note 
that the optimization of the modularity cannot be per- 
formed by exhaustive search since the number of dif- 
ferent partitions is equal to the Bell^ or exponential 
numbers, which grow at least exponentially in the num- 
ber of nodes N. Indeed, optimization of modularity 
is a NP-hard (Non-deterministic Polynomial-time hard) 
problem^. Several authors have attacked the problem, 
with considerable success, by proposing different opti- 
mization heuristics^ - — , see Fortunato^i for a review. 
Maximizing modularity one obtains the "best" parti- 
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FIG. 1. "Gala contemplating the Mediterranean sea which at 
twenty meters becomes a portrait of Abraham Lincoln", by 
Salvador Dali, 1974. Left, at closer distance, and right, at 
larger distance. 



tion of the network into communities. This partition rep- 
resents an intermediate topological scale of organization, 
or mesoscale, that in many cases has been shown to co- 
incide with known information about subdivisions in the 
networ k 7 ' 18 . However, recently, it has been pointed out 
that the optimization of the modularity has a charac- 
teristic scale related to the number of links in the net- 
work, that delimits the resolution beyond which no sep- 
aration into smaller groups can be obtained when opti- 
mizing modularity, even-though these smaller partitions, 
and then different levels of description, are plausible to 
exist from direct observation^. The problem seems then 
that modularity, as it has been prescribed, docs not have 
access to these other levels of description, and then its 
direct interpretation must be cautiously used^S. The rea- 
son for this is that the topological scale at which we have 
access by maximizing modularity has a topological reso- 
lution limit. The analogy with the observation of Dali's 
painting is clear, modularity is our tool to "observe" a 
complex network, and their limit is equivalent of a limit 
in the distance at which we observe the painting (Fig. 1). 
We proposed a method^ that allows the full screening of 
the topological structure at any resolution level using the 
original formulation and semantics of modularity, over- 
coming then the resolution limit. Our aim is to take ad- 
vantage of this method to analyze real data sets in terms 
of clustering. 

The paper is structured as follows: In the next sec- 
tion we overview the multiple resolution method. Once 
the method has been presented, we propose its applica- 
tion for exploratory analysis in the topology of the neural 
network of the nematode C. elegans in section III, and 
its application to data clustering in section IV. Finally 
we present the conclusions of the work in section V. 



II. MULTIPLE RESOLUTION METHOD 

In this section we provide the necessary tools to extend 
the multiple resolution method to the most general case 
of networks with weighted signed directed links. 

A. General formulation of modularity 

The generalization of modularity to any network, with 
weighted, directed and signed values of the weights^ is 
as follows. Let us suppose that we have a weighted undi- 
rected complex network with weights Wij as above. The 
relative strength pi of a node 



Pi 



Wj 

2w ' 



(2) 



may be interpreted as the probability that this node 
makes links to other ones, if the network were random. 
This is precisely the approach taken by Newman and Gir- 
van to define the modularity null case term, which reads 



PiPj 



WiWj 

J2wf 



(3) 



The introduction of negative weights destroys this 
probabilistic interpretation of pi, since in this case the 
values of pi are not guaranteed to be between zero and 
one. The problem is the implicit hypothesis that there 
is only one unique probability to link nodes, which in- 
volves both positive and negative weights. To solve this 
problem, we have to introduce two different probabilities 
to form links, one for positive and the other for negative 
links. 

Let us formalize this approach. First, we separate the 
positive and negative weights: 



w ij = w ij ~ w ij , 
where we use the notation 



u>+ = max{0, Wij} , 
w^j = max{0, ~Wij} 



(4) 

(5) 
(6) 



These expressions are useful since in principle we do not 
know the sign of w^ . The positive and negative strengths 
are given by 
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(7) 

(8) 



and the positive and negative total strengths by 

2«;+ = 5>+ = x;5>+ ) (9) 



2«r = 5>f =£E 



(10) 
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Consequently, 



B. Mesocales analysis for weighted signed networks 



and 



Wj 



2w = 2w^ 



2w~ 



(11) 



(12) 



With these definitions at hand, the connection prob- 
abilities with positive and negative weights are respec- 
tively 



+ 



2w+ ' 



Pi 



2w~ 



(13) 
(14) 



Now, there are two terms which contribute to modu- 
larity: the first one takes into account the deviation of ac- 
tual positive weights against a null case random network 
given by probabilities pf , and the other is its counterpart 
for negative weights. Thus, it is useful to define 



or 




wfwf \ 
w±--l—?-\6(C ii C j ) t (15) 



• j 



2w+ 



w 4 w, f 
2w- 



SiC^Cj). (16) 



The total modularity must be a trade off between the 
tendency of positive weights to form communities and 
that of negative weights to destroy them. If we want that 
Q + and Q~ contribute to modularity proportionally to 
their respective positive and negative strengths, the final 
expression for modularity Q is 



Q 



2w^ 



2w~ 



2w+ + 2w- 2w+ + 2w- 

An alternative equivalent form for modularity Q is 

1 



(17) 



2w^ 
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2w+ 



2w- 



(18) 



The main properties of Eq. (JTSJ) are the following: 
without negative weights, the standard modularity is re- 
covered; modularity is zero when all nodes are together 
in one community; and it is antisymmetric in the weights, 
i.e. Q(C,{wij}) = -Q(C,{- Wij }) . 

The extension to directed networks^ is simply ob- 
tained by the substitutions in Eq. (JTSJ) of 



± . ±,out 



= E 



Wa 



±,in 



E 



w 



(19) 
(20) 



The extension of the multiple resolution method* to 
the general case of weighted signed networks follows the 
same original idea. The method relies on the introduction 
of a magnitude r that we call resistance, represented by 
a self-link for each node, that stands for the opposition of 
a node to belong to a group, in the sense of modularity. 
We tune the resistance uniformly for all nodes because in 
this way the functional form of the strength distribution 
is preserved and does not distort the relative structural 
properties of nodes. More precisely, the formulation of 
modularity Q r at different resolution scales tagged by r 
consists in substituting in Eq. (|18l) 



w 



, j t Wij + T^ij : 



2w ± -> 2w ± 



where 



and 



r + = max{0, r} , 
r~ = max{0, — r} . 



(21) 
(22) 
(23) 



(24) 



(25) 
(26) 



The topological scale determined by maximizing Q at 
which the detection of community structure has been at- 
tacked so far, corresponds to r = (Newman's scale). For 
positive values of r, we have access to the substructure 
below r — 0, and for negative values of r we have access 
to the superstructures. For negative values of r, the re- 
sistance should be understood as an affinity of nodes to 
belong to the same group, and using Eq. |ffj the formu- 
lation is still preserved but not the semantics in terms 
of probabilities. The main challenge in this new scenario 
is that the limiting cases of r that corresponds to the 
partition of individual nodes, and to the whole network 
as a unique module have to be computed using the new 
modularity formulation Eq. (|18p . 



C. Resistance limiting cases for weighted signed networks 

Here we present the mathematical proofs of the phys- 
ical limiting cases of the resistance for weighted signed 
networks. Let us call r max the limit of resistance for 
which all nodes are isolated in communities of size 1, and 
^min the limit for which all nodes become members of a 
single group that represents the whole network. To de- 
termine r max we look for a value of the resistance such 
that the increment in modularity when joining any pair 
of vertices in the same community is negative, and the 
contrary for ?* m i n . The idea is the following: if r > 
and all the non-diagonal terms (i ^ j) of Eq. (TT8")) are 
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negative, 



D. Validation of the method in synthetic networks 



(wf + r)(w+ + r) w i 



Win < 

~ 2w+ + Nr 



2w- 



, Vt ^ j . 



(27) 



then the maximum of Q r is achieved with the partition 
which satisfies S(Ci, Cj) — for all i ^ j, i.e. the partition 
in which all nodes are isolated. Eqs. (|2"?| form a system 
of second order inequations in r. After some algebra, it 
can be shown that r max is the lowest value of r for which 
the following set of inequalities per link (denoted ij) is 
satisfied: 



where 



min[v4r 

r,ij 



Bi 



Cu < 01 



(28) 



In Fig. [5] we have screened the whole range of topo- 
logical scales for three synthetic networks, representing 
the number of modules obtained at the optimal parti- 
tion for Q r and plotting in a matrix the superposition of 
scales found. More precisely, any graphical representa- 
tion of the whole mesoscale should take into account, for 
every pair of nodes, the frequency of mesoscales at which 
they belong to the same community. Each mesoscale 
has a natural length defined by the range of resistances 
r to ] at which it is optimal: 



length = log(r to - 



- log(r frc 



(37) 



A = -2w~ (29) 
B i:j = N(2w~w l3 + w^wj) - 2w~(w+ + tut) (30) 

Cij = 2w~2w + Wij + 2w + w^wJ — 2w"wfw^ (31) 

Equivalcntly, if r < and all the non-diagonal terms 
(* j) °f Eq. (IT51) are positive, 



wfw^ (w i —r)(w- — r) 



2w^ 



j 

2w~ - Nr 



(32) 



the maximum of Q r is achieved with the partition which 
satisfies 5(Ci,Cj) = 1 for all i ^ j, i.e. the partition 
in which all nodes are together in the same community. 
Thus, to determine a lower bound of r m j n we look for the 
largest value of r satisfying 



max^r 2 + B L1 r + dj > 0] 

r,ij 



(33) 



where 



A = 2w + (34) 
B i:j = N{2w + w rj - w+w+) + 2w + {w7 + w~) (35) 

2u>~utut + 2w + w^wJ (36) 



= 2w + 2w Wij 



The value of r obtained from Eqs. (|33|) is only a lower 
bound of the exact r m - m , since these equations are only 
sufficient conditions for the existence of a unique com- 
munty holding all the nodes of the network (not all terms 
in Eq. (fTB|) need to be positive in the r m - m limit). On the 
other hand, Eqs. (|28|) are necessary and sufficient condi- 



tions, and thus the r max found is the exact value. 

The method to unveil the mesoscales of a complex net- 
work consists in to optimize Q r for r in [r m i n , r max ]- Dif- 
ferent values of r will eventually reveal different optimal 
partitions (found by heuristic algorithms to detect com- 
munity structure) that represent intermediate topologi- 
cal scales of the complex network. We have applied this 
method to study the mesoscales in synthetic structured 
networks and real complex networks. 



Thus, the length frequency for a pair of nodes is the sum 
of the lengths corresponding to mesoscales in which they 
belong to the same community, normalized by the to- 
tal length. The graphical representation of this table 
is the frequency mesoscales matrix. First we have com- 
puted the modular structure in a hierarchical scalc-frcc 
network with 125 nodes, RB 125, proposed by Ravasz 
and Barabasi 2 ^. We clearly observe persistent structures 
in 5 and 25 communities respectively, that account for 
the subdivisions more significant in the process, showing 
two hierarchical levels for the structure. 

Another network example used is the H 13-4 
network^!, which corresponds to a homogeneous in de- 
gree network with two predefined hierarchical levels, be- 
ing 256 the number of nodes, 13 the number of links of 
each node with the most internal community (formed by 
16 nodes), 4 the number of links with the most external 
community (four groups of 64 nodes), and 1 more link 
with any other node at random in the network. Both 
hierarchical levels are revealed by the method as they 
correspond to the original construction of the network: 
the first hierarchical level consisting in 4 groups of 64 
nodes, and the second level consisting in 16 groups of 16 
nodes. 

Finally, we have used the FB network proposed by For- 
tunato and Barthelemyi^ to demonstrate the resolution 
limit of modularity (at r = 0). It consists in two cliques 
of 20 nodes linked with two small cliques of 5 nodes. At 
r = the best partition cannot separate the two small 
cliques. We observe that the partition searched by the 
authors, formed by the four cliques isolated in their own 
communities, is obtained by increasing the resolution r, 
showing that the resolution limit of modularity is over- 
come by the method. 

The optimization of modularity in all these cases has 
been performed using existing heuristics found in the 
literatur e) 1 ' 14 ' 16 and compiled in a free toolbox available 
at the authors' webpag 
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FIG. 2. Frequency mesoscales matrices in synthetic complex 
networks. We have computed the topological mesoscales for 
three synthetic networks. Left, we plot the networks and right 
we present their mesoscales matrices. The different color lev- 
els correspond to the superposition of the structures in r, 
which account for the persistence of the partitions revealed. 
See text for details. 



III. APPLICATION TO EXPLORATORY DATA 
ANALYSIS 

Exploratory data analysis stands for the approach to 
data analysis in which some rather general assumptions 
are used to reveal information of the data in a kind of 
inverse hypothesis testing. In our particular scenario, 
we will analyze the structure of the neural connectivity 
of the nematode C. elegant using this approach. Wc 
do not pretend an exhaustive biological classification of 
all functionalities that are related to the topology but 
to show the applicability of the mesoscales analysis de- 
scribed before. A pretty exhaustive analysis of the same 
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FIG. 3. Connectivity matrix of C. 



s neuronal network. 
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FIG. 4. Newman's scale of the C. elegans neuronal network. 
Left, original order, right, reordering by communities. 



system has been recently presented 27 for the scale cor- 
responding to r = 0. The whole nervous system of the 
nematode is composed by 302 neurons whose anatomical 
and connectivity description is completely known. The 
resulting network is represented as a weighted directed 
adjacency matrix, see Fig. [3] We will assume that those 
groups of nodes more persistent throughout the screening 
of the mesoscales of the topology have some functional 
role, and after we will look for this role in the current 
biological literature. 

The original data 2 ^ is a weighted and directed net- 
work, composed of 306 vertices (302 neurons + WE, WI, 
WM and WN) and 2359 arcs. We have discarded nine 
disconnected nodes from the network, the remaining 297 
neurons form a single connected component and will be 
the subject of our analysis. 

We have discrctized the resistance range in 1000 non- 
uniform intervals, in such a way that the last resistance 
increment is ten times larger than the first one, and the 
size of the increments grow at a constant rate. The signif- 
icant Newman's scale r = has been added. The nega- 
tive values of the resistance have been discarded, since we 
are interested only in sub-structure beyond the standard 
Newman's scaled. 



FIG. 5. Mesoscales of the C. elegans: number of clusters in 
the optimal partition at every value of the topological scale 
defined by the log(r — r- m j n ), where r m ; n refers to the exact 
value, not its lower bound. Highlighted in circle, we represent 
the scale that most contributes to the frequency matrix. 



The order of the neurons in the matrix follows that in 
Watts and Strogata^ obtained from experimental data 
by White et al^. The detection of the mesoscales in 
this neuronal system has been performed according to 
the method explained in the previous section. The best 
partition at r = corresponding to the original New- 
man's scale provides with 5 communities. The represen- 
tation of the obtained groups is depicted in Fig. [4] (left). 
This figure does not allow the observation of relevant in- 
formation because the original order of the neurons in 
Fig. |31 however after ordering the neurons in the matrix 
by their communities, the representation shown in Fig. 0] 
emerges. 

The coarse graining at r = provides then with a large 
scale in this system, hence our interest has been spe- 
cially focused in the sub-structural levels, not in supra- 
structural levels, that means that we have analyzed the 
mesoscale for r G [0, r max ], see gray region of Fig. [5j Wc 
used the partition at r = simply as a reference for 
sorting the neurons in the substructures found by the 
multiple resolution method. 

Any trial of classification of the functional role of neu- 
rons of the C. elegans is extremely delicate because the 
multifunctional aspects they have. Many neurons partic- 
ipate in different synaptic pathways resulting in different 
functionalities. This property is also captured by our 
method that shows that at different scales the same neu- 
ron can appear in different groups, i.e. the method is not 
necessarily hierarchical. However, to extract information 
from the results obtained, we use an ensemble of the dif- 
ferent partitions found by screening r, and construct a 
frequency mesoscales matrix, indicating the relative per- 
sistence of each neuron in a particular community. By 
fixing a threshold in the frequency value, we are able to 
unravel sub-structural scales that correspond to groups 



FIG. 6. Frequency matrix of C. elegans neuronal network 
thresholded at 0.6. We used a color scale (same as in Fig[3J) 
to plot the persistence of neurons into the same groups, darker 
values corresponds to more persistent communities and, ac- 
cording to our hypothesis in the exploratory analysis, to spe- 
cific functionalities 

of neurons involved in different functionalities at different 
time scales. 

The most interesting information is that provided at 
a large value of the frequency threshold, because in this 
case the substructures found will contain small groups 
of neurons whose activity response is topologically corre- 
lated, in particular the highlighted scales in Fig. [5] are the 
ones that most contribute to the frequency matrix. Wc 
have studied the ensemble frequency matrix at a thresh- 
old value of 0.6, Fig. [6l the lengths below the thresh- 
old are discarded, and the connected components of the 
graph defined by the remaining lengths are found. Wc 
have chosen this threshold fixing the sizes of the groups 
to be analyzed to be less than ten neurons. With this 
information at hand, and the wide description of each 
neuron found at the public database of C. elegan a 30 ^ 31 , 
we propose a tentative classification of some groups of 
neurons by functionality. 

Our purpose, after identification of individual function- 
alities, has been to assign a specific action to the more 
persistent groups of neurons. The classification obtained 
(see appendix) does not pretend to be exact but to pro- 
vide biologists with a useful information for future re- 
search. 



IV. APPLICATION TO THE UNSUPERVISED 
CLASSIFICATION OF DATA 

Unsupervised classification of data (or data cluster- 
ing) stands for the process of grouping patterns of data 
according to their similarity. A pattern is a vector of 
features (usually understood as a point in a multidimen- 
sional space) that describes the item we wish to classify. 
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FIG. 7. Feature vectors for the Iris data set. Colors correspon- 
dence are: setosa-blue, versicolor-red, and virginica- green. 



The goal of the process of data clustering is to organize 
these patterns into groups, in such a way that patterns 
into the same group are more alike than with other pat- 
terns in other groups. 

The problem of data clustering has been the subject of 
interest in many disciplines where the mining of raw in- 
formation is crucial to understand some phenomenon or 
gain insight into a system. Typical processes where data 
clustering is used are pattern analysis, decision-making, 
machine learning and image segmentation. These sub- 
jects have interesting applications as for example tar- 
geted marketing, biological taxonomy and detecting com- 
munities of interest in the World Wide Web22. 

The methodology used to obtain the clusters from the 
raw data is as follows: First of all, a representation of the 
patterns has to be chosen, and also a feature selection or 
extraction is performed. Feature selection means choos- 
ing, from all the available features, those that will make 
easier the process of clustering, leaving the redundant, 
correlated and less informative features out of the anal- 
ysis. On the other hand, feature extraction consists in 
transforming the original dataset to a new one contain- 
ing only the most relevant information. This first step 
is very important, as the result of the clustering often 
depends directly of the quality of it. Secondly, the simi- 
larity or dissimilarity between each pair of patterns has to 
be computed, which is often done by defining a measure 
of distance. The result of this step is the similarity ma- 
trix, which using the mapping to complex networks can 
be understood as a graph, where each node is a pattern 
and the links are the representation of the similarity^. 
Finally, the main step of the process, the grouping (or 
clustering) algorithm, which will decompose the similar- 
ity matrix and return the groups of data. 

In our approach, the algorithm used to classify the sim- 
ilarity matrix is the multiple resolution algorithm based 
on modularity explained previously in this document. 
Given the nature of this algorithm, the result will not 



FIG. 8. Two principal components of the PCA analysis on 
the Iris dataset. Colors correspondence are: setosa-blue, 
versicolor-red, and virginica- green. The separation of pattern 
classes seems more clear in this projection. 



be a single partition into clusters, but a collection of dif- 
ferent partitions. This fact deserves a reflection about 
how to evaluate the quality of the output obtained. If 
we make a screening between the minimum and maxi- 
mum value of the resistance parameter to obtain every 
topological scale of resolution of the network, each one 
of these resolution levels will provide us with a partition 
of clusters. Then the question is, which one of these par- 
titions is the right one? The answer is that every one 
of them is right, since what we are doing is analyzing 
the network at different levels of resolution, and all the 
information obtained though this process is found in the 
structure of the network. Having pointed that out, the 
problem of choosing the right partition is translated to 
that of choosing the more relevant partitions. The more 
relevant partitions in our scope are those that persist un- 
changed during larger intervals of values of the resistance 
parameter. 

The dataset benchmark selected to perform the data 
clustering is the Iris flower dataset, presented by Sir 
Ronald Aylmer Fisher— in 1936. This dataset consists 
of 150 patterns corresponding to three different classes of 
flowers: Setosa, Versicolor and Virginica. Four features, 
the width and length of petal and sepal, form each pat- 
tern. Plots for the cross- variables and type of flowers are 
represented in Fig. [7] The unsupervised classification of 
this dataset is a major challenge in artificial intelligence 
and statistical theory, because of the patterns' organi- 
zation, while one of the classes is linearly separable and 
then easily to classify by any elemental classification al- 
gorithm, the other two classes are not linearly separable 
and consequently far more difficult to classify. 

Following the steps of data clustering explained above, 
we first performed a feature extraction/selection process. 
The idea here is simply to follow the workflow in any 
clustering problem, where the high dimensionality of the 
data and its redundancy is a main concern. In the partic- 
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FIG. 9. Number of clusters as a function of the resolution 
parameter of the classification method (see text for details). 



ular case we analyze, we can use all the original data with 
no computational stress, however we propose to address 
the feature extraction using PCA which will be the most 
common approach in many scenarios. We performed the 
principal component analysis of the four features that 
form each pattern, and choose to work with the two prin- 
cipal components corresponding to the largest part of 
the data variance. In Fig. [3] a representation of these 
two components is shown. Based on these two variables, 
we propose to build up a similarity matrix as the eu- 
clidean distances between patterns components with re- 
spect to the center of mass of the data set in this space. 
For any pair of flowers i and j, we define the similar- 
ity Sij = d — \\x l — x J ||), where d stands for the average 
distance of the set, and || • || is the euclidean distance 
between the feature vectors of each flower. The result- 
ing similarity matrix is interpreted as a weighted network 
whose communities will, in principle, reproduce the right 
clustering of the data. 

The results of the multiple resolution algorithm on the 
two main components of the Iris dataset is shown in the 
Fig. IH1 It can be observed that the longest plateau in 
terms of the resistance interval values is that formed by 
those partitions that divide the dataset into two commu- 
nities. This is not a surprising fact, as we know before- 
hand that one of the three classes of flowers is linearly 
separable, and then this partition makes totally sense, 
since there is one for the Setosa class and the other one 
containing the Versicolor and Virginica. However, the 
second longest plateau is the one formed by the three 
community partitions, and if we analyze the most resis- 
tant of them, we realize that it largely corresponds to the 
biological taxonomy of the flowers. To be specific, if wc 
calculate the success as the number of correctly classified 
nodes divided by the total number of nodes, we achieve 
for the most resistant partition of three communities a 
94,6% of success compared to the correct biological tax- 
onomy. 



Summarizing, we have presented a possible application 
of the multiple resolution method to the problem of data 
clustering. Our proposal has been proved competitive in 
success with other techniques used in the literature on 
the same benchmark^, but as an essential difference we 
also provide information of grouping at different scales 
of resolution that are invisible to other algorithms. The 
methodology presented so far is plausible to be exten- 
sive to any data clustering problem expressed in terms of 
similarity matrices. 



V. CONCLUSIONS 

Scientists working on the field of complex networks 
have developed tools for the analysis of structural in- 
formation embedded in the topological connectivity ma- 
trix. Specially interesting are the heuristic algorithms 
intended to find the community structure of networks, 
which remind the kind of problems of data clustering 
found in many disciplines. Here we have presented a pos- 
sible application of community detection algorithms to 
help exploratory analysis and data clustering. In par- 
ticular, we have used a previous methodology proposed 
by the authors that allows for a multiple resolution of 
topological scales in the substructure of networks. 

The exploratory analysis of the neural connectivity of 
the nematode C. elegans has been presented. We found a 
tentative classification of groups of neurons presumably 
involved in specific tasks, according to the persistence of 
these groups in the topological analysis. We have also 
exposed the applicability of the method to the unsuper- 
vised classification of data, using the famous Iris dataset 
as a benchmark. The results are encouraging, we observe 
the full spectrum of clusters according to the organiza- 
tion of data, and the most persistent scales are those 
corresponding to well-known facts about its structure, a 
partition in two linearly separable groups, and a parti- 
tion in three groups corresponding to the biological tax- 
onomy. These results open the field of applicability of 
the theory of complex networks to other problems where 
the representation of data as a network allows the use of 
the technology developed so far. 



Appendix A: Functional groups of C. elegans 

Classification of functional groups of neurons resulting 
from the multiple resolution method. Using the database 
WormAtlas^S and the results depicted in Fig. [5] we have 
identified nine groups of neurons of size lower than ten, 
whose functionality can be tentatively related to a spe- 
cific action. The process to assign a tentative function 
to the groups of neurons has been done manually, read- 
ing the associated literature and using the worm-atlas 
database. We expose the list in Tabic HI 
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TABLE I. Temptative functionality of several significant 
groups of neurons found in the mesoscale. 



Cluster of neurons 



Tentative function 



RIAL, RIAR, 
RMDR, RMDVR, 
SMDVR, RMDDL, 
SMDDR 



Nose/Head orientation movement. 



IL1DR, IL1VR, Head-withdrawal reflex, more re- 
IL2DR, IL2VR, lated to dorsal relaxation. When 
RIPR worms are touched on either the 

dorsal or ventral sides of their nose 
with an eyelash, they interrupt the 
normal pattern of foraging and un- 
dergo an aversive head-withdrawal 
reflex. 



IL2, IL2R, OLQVL, 
OLQVR, RIH 



Head-withdrawal reflex, more re- 
lated to ventral relaxation. 



ADLR, AIBR, 
ASEL, ASHR, 
AWCL, AWCR, 
AIAR, AIYL 



Olfactory 
reflex. 



and thermoscnsation 



ASGL, ASJL, 
ASKL, AIAL, 
PVQL 



Chemotaxis to lysine reflex. 



DB1, DB2, DDI, 

VB2, VD2, AS3, 

DA2, DA3, DA4, 
DA5 



Backward sinusoidal movement of 
the worm, more related to touch 
stimulus. 



AVAL, AVAR, 
AVBL, AVBR, 
AVDL, AVDR, 
AVEL, AVER, 
DAI, FLPL 



Forward and Backward sinusoidal 
movement of the worm, more re- 
lated to search for food in starving 
case, involve social feeding effect. 



AVHL, AVHR, 
AVJL, AVFL, 
AVFR 



Impossible to determine from the 
experimental data available. There 
is not any specific function known 
for any of these neurons. 



AVKL, ACKR, 
PDEL, PDER, 
PVM, DVA, WN 



The functionality of this group 
could be related to a relaxation 
state similar to a sleep state, 
with reduced motor activity, de- 
creased sensory threshold, charac- 
teristic posture and easy reversibil- 
ity, basically mediated by PDs 
neurons. 
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